CIND 820 :- Stroke Prediction project

Table of Contents


Introduction

Importing Libaries

Data Exploration

this dataset was highly unbalanced. Only 249 patients suffered a stroke while the remaining 4861 patients did not have the experience.

This attribute has no other useful information and was only used to identify patients.

records with empty value in BMI was replaced with mean of BMI.

only one record were categorized as ‘Other’, so it was deleted

Gender needs to be categorized as binary variable.

Exploratory Data Analysis

Univariate Analysis

Categorical Features

We have 2994 female and 2115 male

Bivariate Analysis

there is not much difference between stroke rate concerning gender

Well, young people rarely have hypertension, while the elderly frequently do. A stroke can be brought on by hypertension. Our data do not paint a very clear picture of hypertension. On patients with hypertension, there is not a lot of information.

People who are married have a higher stroke rate.

People working in the Private sector have a higher risk of getting a stroke. And people who have never worked have a very less stroke rate.

there not much difference in both attribute values. Maybe we have to discard it.

Strokes are more likely to occur in people over 60.

Strokes are more likely to occur in people over 60. Some anomalies can be identified as strokes occurring in people under the age of 20. Given that our food and living habits have an impact on stroke, it's probable that the data is accurate. Another finding is that those over 60 years old make up the group of people who do not experience strokes.

people having stroke have an average glucose level of more than 100.

The relationship between BMI and the risk of stroke has not yet been clearly observed.

there is not much difference in the chances of stroke irrespective of smoking status.

Since correlation check only accept numerical variables, preprocessing the categorical variables becomes a necessary step, we need to convert these categorical variables to numbers encoded to 0 or 1. We use labelEncoder from sklearn.preprocessing as it will be easy to decode a particular label back later after predicting if required.

Data Processing & Modeling

All the predictor variables will be mapped to an array x and the target variable to an array y. The target variable is ‘stroke’ column.

HANDLING IMBALANCED DATA

Model

conclusions